System and method of generating dictionary entries

ABSTRACT

A system for automatically generating a dictionary from full text articles extracts &lt;term, definition&gt; pairs from full text articles and stores the &lt;term, definition&gt; pairs as dictionary entries. The system includes a computer readable corpus having a plurality of documents therein. A pattern processing module ( 120 ) and a grammar processing module ( 125 ) are provided for extracting &lt;term, definition&gt; pairs from the corpus and storing the &lt;term, definition&gt; pairs in a dictionary database ( 145 ). A routing processing module selectively routes sentences in the corpus to at least one of the pattern processing module or grammar processing module. In one embodiment, the routing module is incorporated into the pattern processing module which then selectively routes a portion of the sentences to the grammar processing module. A bootstrapping processing module ( 150 ) can be used to apply &lt;term, definition&gt; entries against the corpus to identify and extract additional &lt;terms, definition&gt; entries.

RELATED APPLICATIONS

The present application claims the benefit of U.S. ProvisionalApplication Ser. No. 60/324,880, entitled “Method for IdentifyingDefinitions and their Technical Terms from on-line Text forAutomatically Building a Glossary of Terms,” filed on Sep. 27, 2001, thedisclosure of which is hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

The present invention relates generally to language processing and moreparticularly relates to a method for automatically generating dictionaryentries using text analysis.

BACKGROUND OF THE INVENTION

The internet has enjoyed tremendous growth over recent years. As aresult, vast quantities of information are readily available to millionsof users. Among the vast content available on the internet are technicalpapers, which are of interest to a large number of people, but may bewritten for a technical audience and assume a baseline understanding ofterms used in the particular field. Since now, more than ever, thisassumption is not necessarily true, the importance of on-linedictionaries of technical terms is of growing importance.

On-line dictionaries of technical terms have been difficult to build andare often lacking in completeness. For example, technical dictionariessuch as the Online Medical Dictionary (OMD,http://www.graylab.ac.uk/omd/index.html), are often missing common termswhich are assumed to be understood by those practicing in the particularfield. In addition, the definitions in such dictionaries are often tootechnical for use by a lay person. Accordingly, it would be desirable toautomatically generate on-line glossaries of technical terms that arecomprehensive and generally useful to the technically oriented user aswell as the lay person.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide a system forautomatically generating dictionaries based on an analysis of full textarticles.

It is an object of the present invention to provide a system forautomatically generating dictionaries for various technical domainsbased on an analysis of fall text articles.

It is a further object of the present invention to provide a system andmethod for extracting term-definition pairs from full text articles, theterm-definition pairs being capable of use as dictionary entries.

In accordance with the invention, a computer-based method forautomatically generating a dictionary based on a corpus of fall textarticles is provided. The method applies linguistic pattern analysis tothe sentences in the corpus to extract simple <term, definition> pairsand identify sentences with candidate complex <term, definition> pairs.Grammar analysis is then applied to the sentences with candidate complex<term, definition> pairs to extract <term, definition> pairs. Theextracted <term, definition> pairs are then stored in a dictionarydatabase.

Pattern analysis can include identifying sentences having text markersand predetermined cue phrases and subjecting the identified sentences torule based <term, definition> extraction. Sentences which include textmarkers can be further processed by a filtering operation to removesentences which are not indicative of having <term, definition> pairs.Grammar processing generally operates upon sentences which includeapposition or are in the form term is definition.

Also in accordance with the present invention is a system forautomatically generating a dictionary from full text articles. Thesystem includes a computer readable corpus having a plurality ofdocuments therein. A pattern processing module and a grammar processingmodule for extracting <term, definition> pairs from the corpus andstoring the <term, definition> pairs in a dictionary database are alsoprovided. A routing processing module is provided to selectively routesentences in the corpus to at least one of the pattern processing moduleand grammar processing module.

Preferably, the system further includes a bootstrap processing module.The bootstrap processing module applies entries in the dictionarydatabase to the corpus and extracts and stores additional <term,definition> pairs in the dictionary database.

In one embodiment, the routing processing module tags sentences whichmay include <term, definition> pairs and routes all tagged sentences tothe pattern processing module. In addition to extracting certain <term,definition> pairs, the pattern processing module performs the additionaloperation of identifying sentences which include candidate complex<term, definition> pairs. The grammar processing module then receivesthe sentences having candidate complex <term, definition> pairs from thepattern processing module and operates to extract <term, definition>pairs from these sentences.

BRIEF DESCRIPTION OF THE DRAWING

Further objects, features and advantages of the invention will becomeapparent from the following detailed description taken in conjunctionwith the accompanying figures showing illustrative embodiments of theinvention, in which:

FIG. 1 is a flow diagram illustrating the overall operation of a systemfor generating entries for a dictionary by extracting term-definitionpairs from full text articles;

FIG. 2 is a flow chart further illustrating the operation of an addlinguistic markup block of FIG. 1;

FIGS. 3A and 3B are a flow chart illustrating the pattern analysisprocessing operation of FIG. 1 in greater detail;

FIG. 4 is a is a flow chart illustrating the grammar analysis processingoperation of FIG. 1 in greater detail;

FIG. 5A is a pictorial representation of an example of a parse treeoutput from a grammar parsing program, such as English Slot Grammar;

FIG. 5B is a pictorial representation of an example of a parse treeoutput from a statistical grammar parsing program, such as Charniak'sstatistical parser;

FIG. 6 is a simplified flow chart illustrating an exemplary appositionprocessing subroutine suitable for use in connection with grammaranalysis processing;

FIG. 7 is a simplified flow chart illustrating a subroutine forextracting <term, definition> tuples from sentences in the form TERM isDEFINITION, which is suitable for use in connection with grammaranalysis processing;

FIG. 8 is a simplified flow chart further illustrating the stepsperformed by a bootstrapping algorithm, which is suitable for use inexpanding the rule based dictionary database;

FIG. 9 is a flow diagram illustrating the overall operation of analternate embodiment of a system for generating entries for a dictionaryby extracting <term, definition> pairs from a corpus of full textarticles;

FIG. 10 is a flow chart further illustrating the operation of a moduleselection logic block of FIG. 9;

FIG. 11 is a flow chart illustrating the pattern analysis processingoperation of FIG. 9 in greater detail;

Throughout the figures, the same reference numerals and characters,unless otherwise stated, are used to denote like features, elements,components or portions of the illustrated embodiments. Moreover, whilethe subject invention will now be described in detail with reference tothe figures, it is done so in connection with the illustrativeembodiments. It is intended that changes and modifications can be madeto the described embodiments without departing from the true scope andspirit of the subject invention as defined by the appended claims.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

FIG. 1 is a flow diagram illustrating the overall operation of a firstembodiment of a system for generating entries for a dictionary byextracting <term, definition> pairs or tuples, from full text articles.While the system and methods of FIG. 1 will be described using theexample of the medical field, the present invention can be applied toany body of text from which a user wishes to build a computer readabledictionary. The method of FIG. 1 is generally performed on aconventional computer system, which is operatively coupled to a sourceof full text articles. The articles can be stored in a local database orcan be accessed over a digital communications network, such as theinternet.

In step 105, the articles, in computer readable form, such as ASCII,HTML and the like, are input to the system. The articles are passed to apreprocessing algorithm, which tokenizes the input articles and formatsthe articles in a manner which is suitable for processing by text parserand part-of-speech tagging algorithms. The preprocessing operations ofstep 110 can include stripping away HTML tags, tokenizing the text andrewriting the text file as one sentence per line with each line numberedand identified to its source text. Tokenizing the text generallyincludes separating, by a space, each unit of a sentence (word,punctuation, number) that can be considered an independent token.Generally, hyphenated words are maintained as a single token.

Following preprocessing, an Add Linguistic Markup operation is performedwhich is used to identify sentences which may include <term, definition>tuples (step 115). The operation of the Add Linguistic Markup operationof step 115 is further illustrated in FIG. 2.

Referring to FIG. 2, adding linguistic mark ups begins by applyingpart-of-speech (POS) tagging to assign a part of speech label to eachword (step 210). A number of known POS tagging programs can be used inconnection with step 210. One suitable POS tagging program is describedby E. Brill, “A Simple Rule-based Part of Speech Tagger,” Proceedings ofthe Third Conference on Applied Natural Language Processing, Trento,Italy, 1992, the disclosure of which is hereby incorporated by referencein its entirety.

Following POS tagging, a noun phrase identification operation, referredto in the art as Noun Phrase Chunking, is applied in step 215 toidentify various forms of noun phrases in the sentences being processed.A suitable method for performing Noun Phrase Chunking is described byRamshaw and Marcus in “Text Chunking Using Transformation-BasedLearning,” Proceedings of Third ACL Workshop on Very Large Corpora,”MIT, 1995, the disclosure of which is hereby incorporated by referencein its entirety. The Noun Phrase Chunking algorithm identifies nounphrases in the sentences being analyzed and sets the noun phrases apartfrom the other text, such as by insertion of brackets. For example, thesentence “Arrhythmia—irregular heartbeat—is experienced by millions ofpeople in a variety of forms.” will be tagged as follows:

-   [Arrhythmia/NNP]--/: [irregular/JJ heartbeat/NN]--/: is/VBZ    experienced/VBN by/IN [millions/NNS] of/IN [people/NNS] in/IN [a/DT    variety/NN] of/IN forms/NNS;    where NNP represents a proper noun, JJ represents an adjective, NN    represents common noun, VBZ is verb, present tense, singular, VBN is    verb, past participles, IN means preposition, DT refers to    determiner and NNS refers to common noun, plural.

After Noun Phrase Chunking, the input sentences are evaluated todetermine whether they include either text markers or cue phrases (Step220). Text markers that have been found to be indicative of a definitioninclude hyphens, such as --, and parenthetical expressions followingnoun phrases. Cue phrases which are indicative of a definition includeforms of “X is the term used to describe Y,” as opposed to phrases suchas “for example” or “such as” which tend to indicate explanation ratherthan definition. Those sentences which are found to include cue phrasesand/or text markers in step 220 are marked with tags in step 225. Instep 240, the sentences marked in step 225 are routed to a patternanalysis module 120 which performs shallow parsing of the sentence toextract simple <term, definition> tuples and also identifies candidatesentences which may include complex definitions.

If in step 220 the sentence being evaluated does not include textmarkers, the sentence is further evaluated to determine whether thesentence includes possible anaphora or represents a form of term isdefinition (step 230). Such sentences are marked with tags indicatingthe possible <term, definition> tuple in step 235 and the markedsentences are routed to the pattern analysis module (Step 240). If thetest performed in step 230 fails, the sentence is rejected as notincluding a <term, definition> tuple.

The sentences that are marked in steps 225 and 235 are passed to thepattern analysis module which is further described in connection withFIGS. 3A and 3B. In step 310, the sentences are evaluated for thepresence of text markers. Those sentences which include text markers arethen subjected to a number of filtering rules in step 315 (FIG. 3B)which are designed to eliminate sentences which are not likely topossess <term, definition> tuples. The filtering rules generally willremove sentences which have conjunctions at the beginning of a textmarker. In addition, the filtering rules will identify and removephrases that indicate explanation rather than definition, such as “forexample,” “for instance” and the like. The filtering rules can alsoidentify and eliminate sentences having lists of commas andconjunctions, which indicate enumeration and have not been found toidentify <term, definition> pairs. It will be appreciated thatadditional rules may be found to be useful in identifying andeliminating phrases set off by text markers which are not indicative ofterm-definition pairs and that such rules could also be implemented instep 315.

Following the filtering operations of step 315, the remaining sentencesincluding text markers are analyzed to identify simple noun phrasepatterns in the form: Noun Phrase 1 {text marker} Noun Phrase 2 (step320). For those sentences which include such noun phrase patterns, whichrepresent a <term, definition> pair, the term and definition need to beidentified in step 325. One method of identifying the term anddefinition components of the <term, definition> tuple is to determinethe frequency of occurrence of Noun Phrase 1 and Noun Phrase 2. The nounphrase having the higher frequency of occurrence is designated the termand the other noun phrase is considered the definition for the term(step 325). The <term, definition> tuples can be used to form a hasharray where the terms are keys to the array and the definitions are thevalues of the array. The hash array is then added to the patterndictionary database 130.

If in step 320, the sentence does not represent a simple noun phrasepattern, the sentence is further evaluated to determine whether thesentence may include a complex definition and be a candidate for furthergrammar processing (step 330). Sentences of various forms havingsyntactic structures more complex than Noun Phrase 1 {text marker} NounPhrase 2 can be candidates for complex definitions. For examplesentences having a form Noun Phrase 1 {text marker} Noun Phrase 2 (.*),where (.*) represents any additional text, can be identified acandidates which may include complex definitions. Sentences whichinclude possible complex definitions are stored in a hash array forsubsequent grammar processing which will be described below inconnection with FIG. 4 (step 335). If in step 330, the sentences do notinclude candidates for complex definitions, the sentences are removedfrom further processing.

Returning to FIG. 3A, in step 310, those sentences which do not includetext markers are passed to step 340 which evaluates the input sentencesto identify cue phrases within the sentences. A non-exhaustive list ofcue phrases includes: “is the term used to describe”, “is defined as”,“is called” and the like. Sentences including cue phrases are parsed byidentifying the context on the left hand side (LHS) of the cue phraseand the context on the right hand side (RHS) of the cue phrase (step345). If the left hand side is a noun phrase, then the LHS noun phrasewill be considered a term and the right hand side will be considered adefinition for the term (step 350). The <term, definition> pair can beused to generate a hash array and added to the pattern dictionarydatabase 130

Those sentences which do not include cue phrases in step 340 are furtherevaluated to determine if they include simple patterns which have beenfound to be representative of simple <term, definition> tuples (step355). Such simple patterns include {Noun Phrase is Noun Phrase} and{Noun Phrase, or (optional) Noun Phrase, |}. For example, “myocardialinfarction, heart attack . . . ” and “myocardial infarction, or heartattack . . . ” illustrate such patterns. <Term, definition> tuples areextracted from sentences having the simple patterns tested for in step355 and the extracted <term, definition> tuples are added to the patterndictionary 130.

Those sentences in step 355 which do not include simple patterns arefurther analyzed to determine if the sentences may include complexdefinitions (step 360). As in the case of sentences having text markers,sentences of various forms having syntactic structures more complex thanNoun Phrase 1 is Noun Phrase 2 can be candidates for complexdefinitions. If a sentence is identified as including a candidatecomplex definition, the sentence is added to a hash table and is storedfor additional processing (step 370). Otherwise, the sentence isdiscarded from additional processing.

Returning to FIG. 1, the sentences that are placed in the hash array asincluding candidates for complex definitions are passed from the patternanalysis processing block 120 to the grammar analysis processing block125 to determine if this subset of sentences include <term, definition>tuples. The operation of the grammar analysis block 125 is furtherillustrated in the flow diagram of FIG. 4.

Referring to FIG. 4, grammar analysis processing begins with a grammarparsing operation (step 405). Grammar parsing can be performed withknown parsing programs such as the English Slot Grammar (ESG) parserprogram, available from International Business Machines and described in“The Slot Grammar System,” by McCord, IBM Research Report, 1991, or astatistical parser, such as that described by E. Charniak in “A MaximumEntropy Inspired Parser,” Proceedings of NAACL 2002, the disclosure ofwhich is hereby incorporated by reference. These parser programs areused to parse each input file and obtain a parsed form for each sentence(step 405). For each sentence, the ASCII-style output which is generallyprovided by the parser program is formatted as a parsed data structurethat provides the dependency between the words in the sentences, thehash array that contains the feature structures associated with eachword, which are the nodes in the tree, and the slot filler type of eachnode (step 410).

A typical parsed tree structure based on the ESG parser is illustrated,for example, in FIG. 5A. In FIG. 5A, the first column 505 indicates theslot filled by the node (word), the second column 510 lists the wordsand positions of the words, the third column 520 lists the featurestructure for each word. FIG. 5B illustrates an example of parsed outputusing Charniak's statistical parser.

Returning to FIG. 4, the parsed tree is traversed to determine thedaughters and ancestors of each node (step 415). The sentences are thenevaluated to determine if they include apposition (step 420) and if theydo, processing is passed to apposition processing logic in step 425. Ifin step 420 the sentences do not include apposition, the sentences areevaluated to determine if the sentences are of the form term isdefinition (<T is D>). For sentences in the form of <T is D>, <T is D>processing block 435 is called to identify and extract term-definitionpairs.

FIG. 6 is a flow chart illustrating an example of logic which can beused to process sentences including apposition in order to extract<term, definition> pairs. In step 605 the tree structure is evaluated todetermine whether the apposition word (‘,’ ‘--’ ‘(’, ) has ancestors. Ifso, then the left conjunction (lconj) is identified and evaluated todetermine if the lconj is a noun (step 610). Similarly, the rightconjunction (rconj) is evaluated to determine if it is a noun (step615). The right conjunction is also tested to determine whether itincludes an additional comma or conjunction, which would be indicativeof enumeration rather than definition (step 620). If the lconj=noun,rconj=noun and rconj does not indicate enumeration, the subtree rootedat lconj is labeled as the term, the subtree rooted at rconj is labeledas the definition and the term-definition pair is added to the hash tree(step 625) as an entry for the grammar dictionary (135). If any of thedecision blocks 605, 610, 615 or 620 fail, the sentence is considered asnot including a <term, definition> tuple (block 630).

FIG. 7 is a flow chart illustrating an example of logic which can beused to process sentences in the form term is definition, <T is D>, toextract <term, definition> pairs. In step 705, the parsed tree of thesentence is evaluated to determine if the root of the tree is a presenttense form of the verb “to be.” If so, then in step 710 the subject ofthe sentence is evaluated to determine if the subject is a noun. If thesubject is noun, then flow proceeds to step 715 where the daughters ofthe subject are evaluated to determine whether the daughters are leftconjunctions (lconj) and right conjunctions (rconj). If in step 715 thedaughters are not lconj, rconj, then the predicate in the sentence istested in step 720 to determine if the predicate is a noun. If thepredicate is a noun, then the subject is identified as the term and thesubtree rooted at the predicate is identified as the definition (step725). The <term, definition> pair is then added to the hash table in thegrammar dictionary 135.

Returning to step 715, if the daughters of the subject are in the formlconj, rconj, than the lconj and rconj terms are evaluated to determineif they are nouns, which indicates that these daughters are possiblesynonyms of the subject. If the lconj and rconj are nouns, then each ofthese daughters is labeled as additional terms (step 745). The predicateof the sentence is then tested to determine if it is a noun in step 750.If the predicate is a noun, than the subtree rooted at the predicate islabeled as the definition for each of the terms. The <term, definition>pairs are then added to the hash array which are then entered into thegrammar dictionary 135. If any of the conditional tests of steps 705,710, 720, 740, or 750 fail, then the sentence is considered as notincluding a term-definition pair.

Referring to FIG. 1, the merged dictionary database 145 can be furtherexpanded with additional <term, definition> pairs by applying abootstrapping algorithm in step 150, such as is described in furtherdetail in FIG. 8.

Referring to FIG. 8, the <term, definition> pairs of the mergeddictionary database 145 are used as seed tuples in the initializationstep (step 805). Then, the terms are identified in the corpora of fulltext articles (step 810) and the left hand side context and right handside context of each term is identified inside the sentence where theterm is identified (step 815). A matching algorithm is then used toidentify correspondence between the definition and the left hand sideand right hand side context, respectively (step 820). To increaseflexibility, word order is not considered by the algorithm. When adefinition is subsumed by either the left side or ride side context oroverlaps either context sufficiently, the context is considered as amatch to the definition and the exact match is called Def.

After matching is performed a new tuple of the form <l,T,m,Def,r> isrecorded that keeps the left(l), the middle(m) and the right(r) contextinside the sentence (step 825). These contexts are in connection withboth term and definition. An example of middle context is: (<term> ischaracterized by <definition>). It has been observed that the middlecontext can be more important than right or left contexts. Thus it maybe preferable to assigned more weight to m.

The sentence from the corpus that contains the <term, definition> pairis parsed using a statistical parser, such as that described by E.Charniak in “A Maximum Entropy Inspired Parser,” Proceedings of NAACL2002, the disclosure of which is hereby incorporated by reference. Thestatistical parser can be used to generate candidate patterns foridentifying additional <term, definition> tuples in the corpus in thefollowing iterative steps.

Candidate patterns are identified from the parse tree (step 830) byperforming a matching algorithm on the parse tree of the sentences fromthe corpus and the initial tuple <term, definition> parse tree. Thesubtrees are matched to a corresponding T and Def and the tree patterncovering these two subtrees is recorded.

A best rank score can then be used to select new patterns which identify<term, definition> tuples in the corpora. The rank score is similar to aRlogF measure used in AutoSlog-TS, by E. Riloff in the paper.“Automatically generating Extraction Patterns from Untagged Text”,Proceeding of AAAI 1996. The rank score is computed as:score(patern)=,R*log₂(F), where F is the number of unique good tuples<term, definition> the pattern extracts, N is the total number of tuples(good and bad) and R=F/N. A pattern can extract both <term, definition>pairs as well as <term, n-n-definition> pairs. The first is considered agood tuple. The latter form, which is not a true <term, definition> pairis considered a bad tuple.

Examples of new patterns extracted using bootstrapping are set forth inthe table below.

Examples of Patterns Identified by Bootstrapping Algorithm <term> ischaracterized by <definition> <term>, in which <definition> <term> - inwhich <definition> <term>, which is <definition> <term> is used to<definition> <term> occurs when <definition> In <term>, <definition>

After new patterns are identified, they are applied to the corpus offull text articles (step 840) and new <term, definition> tuples areidentified (step 850). These tuples are added to the temporary bootstrapdictionary (step 860) and then additional iterations can be performed afixed number of times, such as three times, or until less than apredetermined number of new tuples is identified (step 870).

After the iteration process ends, the original dictionary is then mergedwith the boot strapping dictionary to produce the final outputdictionary (step 880).

An alternate embodiment of the present systems and methods forextracting <term, definition> pairs from full text sources isillustrated in FIGS. 9 through 11. The system of FIG. 9 is similar tothe architecture depicted in FIG. 1 except that instead of routing allsentences through the shallow parsing block 120 of FIG. 1, a moduleselection logic block 915 is used to route the sentences to either oneor both of a pattern analysis 910 and/or a grammar analysis processingblock 925. Thus the embodiment of FIG. 9 replaces the linguistic mark-upblock 915 with the module selection logic block 915 and modifications tothe shallow parsing block are also provided to account for this change.

Except as noted below, processing blocks 905, 910, 925, 930, 935, 940,945 and 950 are substantially the same as processing blocks 105, 110,125, 130, 135, 140, 145 and 150, respectively, which are described abovein connection with FIG. 1.

In step 905, the articles, in computer readable form, such as ASCII,HTML and the like, are input to the system. The articles are passed to apreprocessing algorithm 910, which tokenizes the input articles andformats the articles in a manner which is suitable for processing bytext parser and part-of-speech tagging algorithms. The preprocessingoperations of step 910 can include stripping away HTML tags other thanthose which are emphasis tags, such as <EM> and <B>, tokenizing the textand rewriting the text file as one sentence per line with each linenumbered and identified to its source text. Tokenizing the textgenerally includes separating, by a space, each unit of a sentence(word, punctuation, number) that can be considered an independent token.Generally, hyphenated words are kept as a single token

Following preprocessing of step 910, the system performs a moduleselection operation in module selection logic block 915. The moduleselection block analyzes the input articles to route the text to eithera pattern analysis processing block 920, a grammar analysis processingblock 925 or both processing blocks. The pattern analysis processingblock 920 extracts term-definition pairs from the sentences routed tothis processing block by the module selection logic 915 and places theterm-definition pairs in a pattern dictionary database 930. Theoperation of the pattern analysis processing block 920 is furtherdescribed below in connection with FIG. 10.

Referring to FIG. 10, the operation of the module selection logic block915 is further described. In the module selection logic, the inputsentences are evaluated to determine if they are emphasized sentences(step 1010). In the case of HTML formatted files, emphasized sentencesare identified by the presence of emphasis tags, such as <EM> and <B>.Those sentences which are identified as emphasized are passed to boththe pattern analysis processing block 920 and the grammar analysisprocessing block 925 in step 1015 to extract <term, definition> tuples.

Those sentences from the input articles which are not emphasizedsentences are further evaluated to determine whether the sentencesinclude text markers which are indicative of the presence of adefinition or cue phrases which are indicative of the presence of adefinition (step 1020). Text markers that have been found to beindicative of a definition include hyphens, such as --, andparenthetical expressions following noun phrases. Cue phrases which areindicative of a definition include forms of “X is the term used todescribe Y,” as opposed to phrases such as “for example” or “such as”which tend to indicate explanation rather than definition. Thosesentences which are found to include cue phrases and/or text markers arepassed to the pattern analysis block 920 in step 1025 which performsshallow parsing of the sentence to extract simple <term, definition>tuples.

Those sentences which do not include text markers or cue phrases in step1020 are passed to the grammar analysis block 925 for full parsing andgrammar analysis (step 1030).

FIG. 11 is a simplified flow chart illustrating an embodiment of thelogical flow used in the pattern analysis (shallow parsing) block 920.For each file that is passed to the pattern analysis processing block,part-of-speech (POS) tagging is applied to assign a part of speech labelto each word (step 1110). A number of known POS tagging programs can beused in connection with step 1110. One suitable POS tagging program isdescribed by E. Brill, “A Simple Rule-based Part of Speech Tagger,”Proceedings of the Third Conference on Applied Natural LanguageProcessing, Trento, Italy, 1992, the disclosure of which is herebyincorporated by reference in its entirety.

Following POS tagging, a noun phrase identification operation, referredto in the art as Noun Phrase Chunking, is applied in step 1120 toidentify various forms of noun phrases in the sentences being processed.A suitable method for performing Noun Phrase Chunking is described byRamshaw and Marcus in “Text Chunking Using Transformation-BasedLearning,” Proceedings of Third ACL Workshop on Very Large Corpora,”MIT, 1995, the disclosure of which is hereby incorporated by referencein its entirety. The Noun Phrase Chunking algorithm identifies nounphrases in the sentences being analyzed and sets the noun phrases apartfrom the other text, such as by insertion of brackets. For example, thesentence “Arrhythmia—irregular heartbeat—is experienced by millions ofpeople in a variety of forms.” will be tagged as follows:

-   [Arrhythmia/NNP]--/: [irregular/JJ heartbeat/NN]--/: is/VBZ    experienced/VBN by/IN [millions/NNS] of/IN [people/NNS] in/IN [a/DT    variety/NN] of/IN forms/NNS;    where NNP represents a proper noun, JJ represents an adjective, NN    represents common noun, VBZ is verb, present tense, singular, VBN is    verb, past participles, IN means preposition, DT refers to    determiner and NNS refers to common noun, plural.

Following the Noun Phrase Chunking operation of step 1120, for eachsentence that contain text markers in step 1130, a set of filteringrules will be applied in step 1140 to remove sentences that includemisleading patterns which have been found are not indicative ofterm-definition pairs. The filtering rules generally will removesentences which have conjunctions at the beginning of a text marker. Inaddition, the filtering rules will identify and remove phrases thatindicate explanation rather than definition, such as “for example,” “forinstance” and the like. The filtering rules can also identify andeliminate sentences that have a series of commas and conjunctions, whichindicate enumeration and have not been found to identify term-definitionpairs. It will be appreciated that additional rules may be found to beuseful in identifying and eliminating phrases set off by text markerswhich are not indicative of term-definition pairs and that such rulescould also be implemented in step 1140.

Following the filtering operations of step 1140, the remaining sentencesincluding text markers are analyzed to identify noun phrase patterns inthe form: Noun Phrase 1 {text marker} Noun Phrase 2 {text marker or.}(step 1160). In this pattern, either Noun Phrase 1 or Noun Phrase 2 maybe either the term or definition. To identify the term and definition,for each such noun phrase pattern which represents a term-definitionpair, the frequency of occurrence of Noun Phrase 1 and Noun Phrase 2 aredetermined. The noun phrase having the higher frequency of occurrence isdesignated the term and the other noun phrase is considered thedefinition for the term (step 1165). The <term, definition> tuples canbe used to form a hash array where the terms are keys to the array andthe definitions are the values of the array. The hash array is thenadded to the pattern dictionary database 930.

Returning to step 1130, those sentences which do not include textmarkers are passed to step 1170 which evaluates the input sentences toidentify cue phrases within the sentences. A non-exhaustive list of cuephrases includes: “is the term used to describe”, “is defined as”, “iscalled” and the like. The sentence is then parsed by identifying thecontext on the left hand side (LHS) of the cue phrase and the context onthe right hand side (RHS) of the cue phrase (step 1180). If the lefthand side is a noun phrase, then the noun phrase will be considered aterm and the right hand side will be considered a definition for theterm (step 1190). The <ten, definition> pair can be added to a hasharray and added to the pattern dictionary database 930.

The methods described herein are generally embodied in computerprograms. The programming language and computer hardware on which themethods are performed is not critical to the present invention. It willbe appreciated by those skilled in the art that such programs areembodied on computer readable media, such as optical or magnetic media,such as CD-ROMS, magnetic diskettes and the like. Such programs can alsobe distributed by downloading the programs over a digital data network.

The systems and methods described herein provided for the automaticgeneration of dictionary entries based on an analysis of full textmaterials. When a corpus of domain specific full text materials areprovided, a domain specific dictionary, such as a dictionary oftechnical terms, can be generated. The dictionary can be dynamic, withnew entries being added when additional full text materials are input tothe system to extract <term, definition> pairs.

Although the present invention has been described in connection withspecific exemplary embodiments, it should be understood that variouschanges, substitutions and alterations can be made to the disclosedembodiments without departing from the spirit and scope of the inventionas set forth in the appended claims.

1. A method for automatically generating a dictionary based on a corpusof full text articles, comprising: applying linguistic pattern analysisto the sentences in the corpus to extract <term, definition> pairs andidentify sentences with candidate complex <term, definition> pairs;applying grammar analysis to the sentences with candidate complex <term,definition> pairs to extract <term, definition> pairs; storing theextracted <term, definition> pairs in a dictionary database; wherein thelinguistic pattern analysis further comprises identifying sentencesincluding text markers, and wherein the sentences including text markersare subjected to filtering to remove sentences not likely to include<term, definition> pairs, and wherein the filtering includes rulesselected from the group including sentences with conjunctions at thebeginning of a text marker, sentences having phrases indicative ofexplanation, and sentences having patterns indicative of enumeration. 2.The method for automatically generating a dictionary according to claim1, wherein pattern analysis of sentences including text markersidentifies noun phrase patterns of the form <Noun Phrase 1> text marker<Noun Phrase 2> as <term, definition> pairs.
 3. The method forautomatically generating a dictionary according to claim 2, wherein thefrequencies of Noun Phrase 1 and Noun Phrase 2 are determined and thenoun phrase having the higher frequency is considered the term and theother noun phrase is considered the definition for the term.
 4. Themethod for automatically generating a dictionary according to claim 2,wherein sentences with text markers having complex linguistic patternsare identified as sentences including candidate complex <term,definition> pairs.
 5. The method for automatically generating adictionary according to claim 1, wherein sentences which do not includetext markers, but include complex linguistic patterns are identified asincluding candidate complex <term, definition> pairs.
 6. The method forautomatically generating a dictionary according to claim 1, whereingrammar analysis of sentences including candidate complex <term,definition> pairs includes at least one of processing of sentencesexhibiting apposition and processing of sentences in the form <term> is<definition>.
 7. The method for automatically generating a dictionaryaccording to claim 1, further comprising a bootstrap process whichapplies the <term, definition> pairs in the dictionary to the corpus toextract additional <term, definition> pairs.
 8. A method forautomatically generating a dictionary based on a corpus of full textarticles, comprising: applying linguistic pattern analysis to thesentences in the corpus to extract <term, definition> pairs and identifysentences with candidate complex <term, definition> pairs; applyinggrammar analysis to the sentences with candidate complex <term,definition> pairs to extract <term, definition> pairs; storing theextracted <term, definition> pairs in a dictionary database; wherein thelinguistic pattern analysis further comprises identifying sentencesincluding cue phrases and wherein sentences including predetermined cuephrases are parsed to identify left hand side context and right handside context of the cue phrase, and wherein if the left hand side andright hand side contexts are noun phrases, then the left hand sidecontext is considered the term and the right hand side context isconsidered the definition for the term.
 9. A method for automaticallygenerating a dictionary based on a corpus of full text articles,comprising: applying linguistic pattern analysis to the sentences in thecorpus to extract <term, definition> pairs and identify sentences withcandidate complex <term, definition> pairs; applying grammar analysis tothe sentences with candidate complex <term, definition> pairs to extract<term, definition> pairs; storing the extracted <term, definition> pairsin a dictionary database; wherein grammar analysis of sentencesincluding candidate complex <term, definition> pairs includes processingof sentences exhibiting apposition wherein apposition processingincludes: identifying the left conjunction and right conjunction in thesentence; and if the right conjunction is a noun phrase and does notinclude a further conjunction, the right conjunction is considered thedefinition and the left conjunction is considered the term of a <term,definition> pair.
 10. A method for automatically generating a dictionarybased on a corpus of full text articles, comprising: applying linguisticpattern analysis to the sentences in the corpus to extract <term,definition> pairs and identify sentences with candidate complex <term,definition> pairs; applying grammar analysis to the sentences withcandidate complex <term, definition> pairs to extract <term, definition>pairs; storing the extracted <term, definition> pairs in a dictionarydatabase; wherein grammar analysis of sentences including candidatecomplex <term, definition> pairs includes processing of sentences in theform <term> is <definition> and wherein sentences of the form <term> is<definition> are further processed by the steps: if the root of thesentence is a present tense for of the verb “to be” and the subject ofthe sentence is a noun then evaluate the predicate of the sentence;evaluate daughters of subject to determine if daughters are leftconjunctions and right conjunctions; if daughters are not in the form ofleft conjunctive and right conjunctive, evaluate predicate, and if thepredicate of the sentence is a noun, the subject is considered the termand a subtree rooted at the predicate is considered the definition ofthe <term, definition> pair.
 11. The method for automatically generatinga dictionary according to claim 10, wherein if the daughters are in theform of left conjunction and right conjunction, determine if daughtersare both nouns; if daughters are both nouns, assign left conjunctive andright conjunctive as additional terms; and if predicate is a noun, thenconsider subtree rooted at the predicate as the definition for <term,definition> pairs for both left conjunction and right conjunction terms.12. A system for automatically generating a dictionary from full textarticles comprising: a computer readable corpus having a plurality ofdocuments therein; a computer readable dictionary database; a patternprocessing module for extracting <term, definition> pairs from thecorpus and storing the <term, definition> pairs in the dictionarydatabase; a grammar processing module for extracting <term, definition>pairs from the corpus and storing the <term, definition> pairs in thedictionary database; and a routing processing module which routessentences in the corpus to the pattern processing module; wherein thepattern processing module further comprises a filtering module to removesentences not likely to include <term, definition> pairs and wherein thefiltering module applies at least one rule selected from the groupincluding sentences with conjunctions at the beginning of a text marker,sentences having phrases indicative of explanation, and sentences havingpatterns indicative of enumeration.
 13. The system for automaticallygenerating a dictionary according to claim 12, further comprising abootstrap processing module, the bootstrap processing module applyingentries in the dictionary database to the corpus and extracting andstoring additional <term, definition> pairs.
 14. The system forautomatically generating a dictionary according to claim 12 wherein: therouting processing module tags sentences which may include <term,definition> pairs and routes all tagged sentences to the patternprocessing module; the pattern processing module performs the additionaloperation of identifying sentences which include candidate complex<term, definition> pairs; and the grammar processing module receives thecandidate complex <term, definition> pairs from the pattern processingmodule.
 15. The system for automatically generating a dictionaryaccording to claim 12, wherein the pattern processing module identifiessentences including at least one of text markers and cue phrases as<term, definition> pairs.
 16. The system for automatically generating adictionary according to claim 15, wherein the pattern processing moduleprocesses sentences including text markers by identifying noun phrasepatterns of the form <Noun Phrase 1> text marker <Noun Phrase 2> as<term, definition> pairs.
 17. The system for automatically generating adictionary according to claim 16, wherein the pattern processing moduledetermines the frequency of occurrence of each of Noun Phrase 1 and NounPhrase 2 and assigns the noun phrase having the higher frequency as theterm and the other noun phrase as the definition for the term.
 18. Thesystem for automatically generating a dictionary according to claim 16,wherein the pattern processing module identifies sentences with textmarkers having complex linguistic patterns as sentences includingcandidate complex <term, definition> pairs.
 19. The system forautomatically generating a dictionary according to claim 15, wherein thepattern processing module identifies sentences which do not include textmarkers, but include complex linguistic patterns as including candidatecomplex <term, definition> pairs.
 20. The system for automaticallygenerating a dictionary according to claim 12, wherein the grammarprocessing module extracts <term, definition> pairs from sentencesincluding at least one of sentences exhibiting apposition and sentencesin the form <term> is <definition>.
 21. A system for automaticallygenerating a dictionary from full text articles comprising: a computerreadable corpus having a plurality of documents therein; a computerreadable dictionary database; a pattern processing module for extracting<term, definition> pairs from the corpus and storing the <term,definition> pairs in the dictionary database; a grammar processingmodule for extracting <term, definition> pairs from the corpus andstoring the <term, definition> pairs in the dictionary database; and arouting processing module which routes sentences in the corpus to thepattern processing module, wherein the pattern processing moduleidentifies sentences including cue phrases as <term, definition> pairs,and wherein the pattern processing module parses sentences includingpredetermined cue phrases to identify left hand side context and righthand side context of the cue phrase, and wherein if the left hand sideand right hand side contexts are noun phrases, then the left hand sidecontext is assigned as the term and the right hand side context isassigned as the definition for the term.
 22. Computer readable mediaencoded with instructions to direct a computer system to generate adictionary based on a corpus of full text articles, comprising the stepsof: applying linguistic pattern analysis to sentences in a corpus toextract <term, definition> pairs and identify sentences with candidatecomplex <term, definition> pairs; applying grammar analysis to thesentences with candidate complex <term, definition> pairs to extract<term, definition> pairs; and storing the extracted <term, definition>pairs in a dictionary database; wherein the linguistic pattern analysisfurther comprises identifying sentences including text markers and,wherein the sentences including text markers are subjected to filteringto remove sentences not likely to include <term, definition> pairs, andwherein the filtering includes rules selected from the group includingsentences with conjunctions at the beginning of a text marker, sentenceshaving phrases indicative of explanation, and sentences having patternsindicative of enumeration.